Graph kernels for chemical informatics
نویسندگان
چکیده
Increased availability of large repositories of chemical compounds is creating new challenges and opportunities for the application of machine learning methods to problems in computational chemistry and chemical informatics. Because chemical compounds are often represented by the graph of their covalent bonds, machine learning methods in this domain must be capable of processing graphical structures with variable size. Here, we first briefly review the literature on graph kernels and then introduce three new kernels (Tanimoto, MinMax, Hybrid) based on the idea of molecular fingerprints and counting labeled paths of depth up to d using depth-first search from each possible vertex. The kernels are applied to three classification problems to predict mutagenicity, toxicity, and anti-cancer activity on three publicly available data sets. The kernels achieve performances at least comparable, and most often superior, to those previously reported in the literature reaching accuracies of 91.5% on the Mutag dataset, 65-67% on the PTC (Predictive Toxicology Challenge) dataset, and 72% on the NCI (National Cancer Institute) dataset. Properties and tradeoffs of these kernels, as well as other proposed kernels that leverage 1D or 3D representations of molecules, are briefly discussed.
منابع مشابه
Graph Kernels for Chemoinformatics
In chemoinformatics and bioinformatics, it is effective to automatically predict the properties of chemical compounds and proteins with computeraided methods, since this can substantially reduce the costs of research and development by screening out unlikely compounds and proteins from the candidates for ‘wet” experiment. Data-driven predictive modeling is one of the main research topics in che...
متن کاملAn Application of Boosting to Graph Classification
This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. We also discuss...
متن کاملA Netflow Distance between Labeled Graphs: Applications in Chemoinformatics
We propose a novel measure of similarity between labeled graphs which has applications to structured data analysis, for e.g. chemical informatics, web document clustering, etc. Exact metrics on graphs based on subgraph isomorphism have been proposed earlier but due to the lack of an efficient algorithm, they cannot be applied on large sized data. Our metric on graphs exploits vertex context sim...
متن کاملOn the Zagreb and Eccentricity Coindices of Graph Products
The second Zagreb coindex is a well-known graph invariant defined as the total degree product of all non-adjacent vertex pairs in a graph. The second Zagreb eccentricity coindex is defined analogously to the second Zagreb coindex by replacing the vertex degrees with the vertex eccentricities. In this paper, we present exact expressions or sharp lower bounds for the second Zagreb eccentricity co...
متن کاملAn Efficient Sampling Scheme For Comparison of Large Graphs
As new graph structured data is being generated, graph comparison has become an important and challenging problem in application areas such as molecular biology, telecommunications, chemoinformatics, and social networks. Graph kernels have recently been proposed as a theoretically sound approach to this problem, and have been shown to achieve high accuracies on benchmark datasets. Different gra...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Neural networks : the official journal of the International Neural Network Society
دوره 18 8 شماره
صفحات -
تاریخ انتشار 2005